Univariate Analysis

Summary: 1.LotArea- The variable exhibits a relatively high standard deviation, indicating a wide spread of values around the mean. The positive kurtosis value suggests a heavy-tailed distribution with a relatively large number of outliers or extreme values. The positive skewness indicates that the distribution is skewed to the right, with a longer tail on the right side. Overall, the variable shows significant variability and deviation from a normal distribution, with a pronounced right skewness and high kurtosis, indicating the presence of outliers or extreme values in the upper range of the distribution

2.YearBuilt-The variable has a relatively low standard deviation of 30, indicating a narrow spread of values around the mean. The negative kurtosis value suggests a slightly flatter distribution compared to a normal distribution. The negative skewness indicates that the distribution is slightly skewed to the left, with a longer tail on the left side

3.BsmtUnfSF:The variable has a high standard deviation of 442, indicating a wide spread of values around the mean. The positive kurtosis value of 0.47 suggests a slightly heavier tail compared to a normal distribution. The positive skewness of 0.92 indicates that the distribution is skewed to the right, with a longer tail on the right side. 4.1stFlrSF:the variable exhibits a wide range of values with a relatively high standard deviation and positive skewness, indicating a skewed distribution with a longer tail on the right side. The high positive kurtosis suggests a significant presence of outliers or extreme values in the upper range of the distribution

5.Overall summary :In the above graph we can see the some variables are normally distributed and some have low outliers while some have high outliers. we can also see there is left skewness and right skewness. positive kurtosis indicates a more peaked and heavy-tailed distribution, while negative kurtosis indicates a flatter and light-tailed distribution relative to a normal distribution

Summary: MSSubClass: The majority of the dwellings involved in the sale belong to two categories, 20 and 60, which together account for more than 58% of the total MSSubClass values.

MSZoning: The RL type of zoning is the most prevalent, constituting approximately 79% of the observations.

Street: The vast majority, around 99.5%, of the properties have a paved street, while a very small percentage have a gravel street.

Alley: The variable "Alley" represents the type of alley access to the property. It is observed that the number of properties with a gravel alley is slightly higher compared to other types of alley access.

LotShape: The general shape of the properties is primarily regular, accounting for approximately 63% of the observations.

LandContour: The majority of the properties exhibit a flat level, with approximately 90% of the data falling under this category.

Utilities: The "Utilities" variable indicates the type of utilities available. The vast majority, around 99%, have access to all public utilities, including electricity, gas, water, and sewage.

HouseStyle: The analysis reveals that the most common house style is "1Story," which comprises approximately 50% of the observations.

RoofStyle: The "RoofStyle" variable indicates the style of the roof. The analysis shows that "Gable" roofs are the most prevalent, accounting for around 78% of the data, followed by approximately 20% of roofs with a "Hip" style. In summary:

the analysis provides insights into various features of the dataset. It reveals the predominant categories within each variable, such as the most common dwelling type, zoning type, street type, and property shape. These findings can help in understanding the distribution and characteristics of the data, which may be useful for further analysis or decision-making processes related to the dataset.

Bivariate Analysis

Visualize the relationship between continuous variables and the target variable

Visulaize the relationship between category variables and target variable(SalePrice)

Correlation

Continuous-Continuous Variables

The variable with the highest positive correlation to "SalePrice" is "GrLivArea" with a correlation coefficient of 0.709. Other variables with relatively high positive correlations include "GarageCars" (0.640), "GarageArea" (0.623), "TotalBsmtSF" (0.614), and "1stFlrSF" (0.606). Variables like "FullBath", "TotRmsAbvGrd", "YearBuilt", and "YearRemodAdd" also show moderate positive correlations.

The range of eta is between 0 and 1. A value closer to 0 indicates all categories have similar values, and any single category doesn’t have more influence on variable y. A value closer to 1 indicates one or more categories have different values than other categories and have more influence on variable y.

The summary provides information about the mean, standard deviation (measure of variability), minimum, quartiles (25th, 50th, and 75th percentiles), and maximum values for each variable in the dataset

Great we can be able to decrease the error by using Regularization from approx 0.18 to 0.14

Linear regression and randomforest is working well

Well we can see the lowest error is in random forest approx 0.14 which is better than addaboost

With the help of stacking we can be able to decrease the error value

Plotting residual curve (if there constant Variance OR Homoscedastic?)

From the above scatter plot we can assume that there is not any pattern in the residuals

Checking the normal distribution of the residual or not

Residual values is normally distributed but having some outliers

The Above QQ-plot clearly tell that the residual are normally distributed with few outliers

RandomForest featureimportances tells that BedroomAbFGr ,OverallQual,Utilites ,HeatingQC and Street are most important feature

The top features according to lasso regression are OverallQual,BedroomAbvGr and Utilites.

According to importance of feature we can train our model while using the top features

Collecting the top features according to feature selection and feature importance from different models

Wonderful! with the help of feature selection ,feature importance ,stacking,hyperparameter tuning in ensemble model we are successfully able to decrease the number of features from 80 to 8.

The best thing is we are getting same RMSLE as we are getting with the help of 80 features